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ENHANCED RAID WRITE HOLE 
PROTECTION AND RECOVERY 

CROSS-REFERENCES TO RELATED 
APPLICATIONS 

This application is related to co-pending Application Ser. 
No. 08/542,827 "A RAID ARRAY DATASTORAGE SYS- 
TEM WITH STORAGE DEVICE METADATA AND RAID 
SET METADATA," and assigned to the assignee of this 
invention, filed on even date herewith. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to RAID (Redundant Arrays 
of Independent Disks) architecture and systems and more 
particularly relates to methods and apparatus for enhancing 
RAID write hole protection and recovery from system 
failures which would normally leave a "write hole". 

2. Description of Related Art 

The last twenty years have witnessed a revolution in 
information technology. Information handling and process- 
ing problems once thought to be insoluble, have been 
solved, and the solutions have become part of our daily lives. 
On-line credit card processing, rapid and reliable travel 
reservations completely automated factories, vastly 
improved weather forecasting through satellites, world wide 
Internet communications and many other breakthroughs all 
represent formidable computing challenges, yet all have 
become commonplace in our lives. 

These challenges have been met and mastered in large 
part because of the rate of technological progress in the 
components that comprise computer systems: 

Computers themselves have decreased in size from huge 
racks of equipment to units that can be placed on or beside 
a desk. While they have been shrinking physically, their 
capability to deliver cost effective high-performance com- 
puting has been doubling about every three years. 

Graphical capabilities have progressed from the simple 
fixed-font character cell display to high-resolution multi- 
color displays, with advanced features such as three dimen- 
sional hardware assist becoming increasingly common. 

Networks have proliferated. Low-cost, easily accessible 
communications capacity has grown from thousands of bits 
per second to tens of million bits per second, with billions 
of bits per second capability starting to appear on the 
horizon. 

The shrinking computer has combined with the local area 
network to create client-server computing, in which a small 
number of very powerful server computers provide storage, 
backup, printing, wide area network access, and other ser- 
vices for a large number of desktop client computers. 

The capabilities of the bulk data storage for these 
machines' are equally impressive. A billion bytes of mag- 
netic storage, that thirty years ago required as much space 
and electrical power as a room full of refrigerators, can now 
be easily held in one's hand. 

With these capabilities has come a dependence upon the 
reliable functioning of computer systems. Computer systems 
consisting of components from many sources installed at 
many locations are routinely expected to integrate and work 
flawlessly as a unit. 

Redundant Arrays of Independent Disks, or RAID tech- 
nology is sweeping the mass storage industry. Informed 
estimates place its expected usage rate at 40% or more of all 
storage over the next few years. 
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There are numerous RAID techniques. They are briefly 
outlined below. A more thorough and complete understand- 
ing may be had by referring to "The RAIDbook, A Source 
Book for Disk Array Technology" the fourth edition of 

5 which was published by the RAID Advisory Board 
(RAB™), St. Peter, Minn. 

The two most popular RAID techniques employ either a 
mirrored array of disks or striped data array of disks. A 
RAID that is mirrored presents very reliable virtual disks 

10 whose aggregate capacity is equal to that of the smallest of 
its member disks and whose performance is usually mea- 
surably better than that of single member disk for reads and 
slightly lower for writes. 

A striped array presents virtual disks whose aggregate 
capacity is approximately the sum of the capacities of its 

1 members, and who's read and write performance are both 
very high. The data reliability of a striped array's virtual 
disks, however, is less than that of the least reliable member 
disk. 

Disk arrays may enhance some or all of three desirable 

20 storage properties compared to individual disks: 

They may improve 1/0 performance by balancing the I/O 
load evenly across the disks. Striped arrays have this 
property, because they cause streams of either sequen- 
tial or random I/O requests to be divided approximately 

25 evenly across the disks in the set. In many cases, a 
mirrored array can also improve read performance 
because each of its members can process a separate 
read request simultaneously, thereby reducing the aver- 
age read queue length in a bus system. 

30 They may improve data reliability by replicating data so 
that it not destroyed or inaccessible if the disk on which 
it is stored fail. Mirrored arrays have this property, 
because they cause every block of data to be replicated 

35 on all members of the set. Striped arrays, on the other 
hand do not, because as a practical matter, the failure of 
one disk in a striped array renders all the data stored on 
the array virtual disks inaccessible. 
They may simplify storage management by treating more 

40 storage capacity as a single manageable entity. A sys- 
tem manager who managing arrays of four disks (each 
array presenting a single virtual disk) has one fourth as 
many directories to create, one fourth as many user disk 
space quotas to set, one fourth as many backup opera- 

45 tions to schedule etc. Striped arrays have this property, 
while mirrored arrays generally do not. 
With respect to classification (sometimes referred to as 
levels), some RAID levels are classified by the RAID 
Advisory Board (RAB™) as follows: 

50 Very briefly, a RAID 0 employs striping, or distributing 
data across the multiple disks of an array of disks by 
striping. No redundancy of information is provided but 
data transfer capacity and maximum I/O rates are very 
high. 

55 In RAID level 1, data redundancy is obtained by storing 
exact copies on mirrored pairs of drives. RAID 1 uses 
twice as many drives as RAID 0, has a better data 
transfer rate for read but about the same for write as to 
a single disk. 

60 In RAID 2, data is striped at the bit level. Multiple error 
correcting disks (Data protected by a Hamming code) 
provides redundancy, a high data transfer capacity for 
both read and write, but because multiple additional 
disk drives are necessary for implementation, not a 

65 commercially implemented RAID level. 

In RAID level 3: Each data sector is subdivided and the 
data is striped, usually at the byte level across the disk 
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drives, and one drive is set aside for parity information. Xoring the result with the data bit on drive #3 and then 

Redundant information is stored on a dedicated parity Xor'ing that result with the data bit on drive #4. The result 

disk. Very high data transfer, read/write I/O. is the missing data bit which is attributable to that missing 

In RAID level 4, data is striped in blocks, and one drive bit in slice 1 on drive 2. This activity continues for each bit 

is set aside for parity information. 5 of the block. Once again, inasmuch as the process of 

In RAID 5, data and parity information is striped in determining parity is commutative, that is the order of 

Blocks and is rotated among all drives on the array. Xoring is unimportant, the order of making the determina- 

Because RAID 5 is the Raid level of choice, it shall be tion is unimportant, 

used in the following discussion. However, much of what This reconstruction is accomplished by the RAID 

follows is applicable to other RAID levels, including RAID 10 controller, in conjunction with array management software, 

levels 6 et seq,, not discussed above. The invention is which examines the sum of each BIT position to assign an 

particularly applicable to Raid levels employing a parity or even or an ODD number to disk 5. If a disk fails a 0 or a 1 

ECC form of data redundancy and/or recovery. is assigned to the missing value and a simple calculation is 

Raid 5 uses a technique (1) that writes a block of data performed. The missing bit is the Xor of the members 

across several disks (i.e. striping), (2) calculates an error 15 including parity. This process is repeated, and the data is 

correction code (ECC, i.e. parity) at the bit level from this rebuilt. If a disk drive (#2 in the example) has failed, and 

data and stores the code on another disk, and (3) in the event information on that disk is called for by the user, the data 

of a single disk failure, uses the data on the working drives will be built on the fly and placed into memory until a 

and the calculated code to "Interpolate" what the missing replacement drive may be obtained. In this manner, no data 

data should be (i.e. rebuilds or reconstructs the missing data 20 is lost. By way of definition, "consistent parity" is the parity 

from the existing data and the calculated parity). A RAID 5 as recorded on the media which is the Xor of all the data bits 

array "rotates" data and parity among all the drives on the as recorded on the media. It should be understood that in the 

array, in contrast with RAID 3 or 4 which stores all calcu- event that the data from one of the members becomes 

lated parity values on one particular drive. The following is unavailable, that data can be reconstructed if the parity is 

a simplified example of how RAID 5 calculates ECCs (error 25 consistent. 

correction codes, or commonly referred to as parity), and A write hole can occur when a system crashes or there is 

restores data if a drive fails. a power loss with multiple writes outstanding to a device or 

By way of example, and referring to the prior art drawing member disk drive. One write may have completed but not 

of FIG. 1, assume a five -drive disk array or RAID set on all of them, resulting in inconsistent parity. Prior solutions 

which it is intended to store four values, e.g. the decimal 30 have been to reconstruct the parity for every block. Since 

numbers 150, 112, 230, and 139. In this example, the this is accomplished at the bit level, as may be imagined, this 

decimal number 150 is binary 10010110 and is to be written can be very time consuming if billions of bits are involved, 
on disk 1. The number 112 as binary number 01110000 to be 

written on disk 2, the number 230 or binary number SUMMARY OF THE INVENTION 

11100110 on disk 3 and the number 139 as binary number 35 * . „, rtf • „ • • t , „ rf , 

a f\<\ a , , . ~ , . In view 01 the above, it is a principal ob ect or the present 

10001011 on disk 4. When the four values are written to . . . , . , . 1 j -a. *. c 

„ . , , . invention to provide a technique, coupled with apparatus, for 

disks 1-4, the RAID controller examines the sum of each bit . , f „ 7 u-,/r -r^«- T ^ 

\ . . „ , , . T ~ . - . . . inhibiting write hole formation while facilitating recovery 01 

position, in what is called a slice. It the sums of the bit "write holes" 
position is an odd number then the odd number 1 is assigned 

as the parity number; if the sum is an even number, then that 40 ^ othe / ob J ect of the P resent mention is to provide a 

is designated an even number, "0". (It should be noted that method of markm S new data prior to its being written as well 

if a reconstruct algorithm (RW) is employed, the parity may as storm 8 old data which P ermits Realized reconstruction of 

be calculated prior to any writes and that writes are done data mthoui h ™f% t0 reconstruct the entire array of disks 

essentially in parallel and simultaneously. Thus when the when a crash or failure occurs - 

calculation for parity is accomplished is primarily a function 45 Yet another object of the present invention is to provide 

of the choice of algorithm as well as its precise implemen- means for quickly identifying possible bad data blocks and 

tation in firmware.) Another way in which parity is write holes and permitting the most expeditious recovery 

determined, as will be more fully exemplified below, is to including possible reconstruction from precautionary mea- 

exclusive OR [Xor] the first two consecutive bits of a slice, sures taken prior to the crash or failure of the system. 

Xor the result with the next bit and so on, the Xor with the 50 These and other objects are facilitated by employing a 

final bit of the last data-carrying-drive being the parity. nonvolatile write back cache and by storing the metadata 

However, it should be recognized that the process of deter- information in nonvolatile memory as well. Cached meta- 

mining parity is commutative, that is the order of Xoring is data includes at least three fields. One field contains the LBN 

unimportant. (and thus the Logical Block Address [LBA]) of the material 

Assume disk 2 fails, in the example of FIG. 1. In that 55 to be written, another field contains the device ID, and the 

event, the following occurs. The RAID disk controller no third field contains the block status (dirty, clean, inconsistent 

longer ascertains that the value of bit 7 is a 0 on disk 2. parity on slice, parity valid). [Hereinafter, crash = an unin- 

However, the controller knows that its value can be only a tentional power or function cessation, which could be from 

0 or a 1. Inasmuch as disks 1, 3, 4 & 5 are still operating, the a controller, cache, memory drive, computer system etc. 

controller can perform the following calculations: 1+7+1+ 60 unexpectedly ceasing operation due to power loss, failure 

1-an odd number, or 1. Since 1+0+1+1- an odd number, etc.] In the instance where a crash occurs during a "write" 

then the missing value on disk 2 must be a 0. The RAID to disk, it is possible the "write" wrote some, but not all, of 

controller then performs the same calculation for the remain- the data blocks and the parity block. This results in incon- 

ing bit positions on disk 2, In this way data missing due to sistent parity across the slice. Since a write-back cache 

a drive failure may be rebuilt. Another, and often more 65 which is non-volatile is employed, the data that was to be 

convenient way of determining the missing value, i.e. a 0 or written is still retained, and by use of the metadata infor- 

a 1, is by Xor'ing the parity with the data bit on drive #1, mation which is also saved in non-volatile memory, where 
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the write was intended is known. Thus the write that was 
occurring when the crash took place can now be recon- 
structed and it can be insured that the parity is consistent. In 
this manner only the blocks affected by the crash may be 
corrected, and parity for every block does not have to be 
recalculated. 

Other objects and a more complete understanding of the 
invention may be had by referring to the following descrip- 
tion taken in conjunction with the accompanying drawings 
in which: 

BRIEF DESCRIPTION OF THE DRAWING(S) 

FIG. 1 is a table illustrating how parity is determined in 
a five disk drive array for a RAID set; 

FIG. 2 is a table showing a five disk drive array for a 
RAID level 5 architectured RAID set, with striping of data 
and parity, and including the hidden or metadata blocks of 
information for both devices and the RAID set; 

FIG. 3 is a block diagram illustrating a preferred redun- 
dant controller and cache arrangement which may be 
employed in accordance with the present invention; 

FIG. 4 is a diagrammatic representation of sample data 
mapping for a RAID level 5 array, illustrating two block 
high data striping with interleaved parity and left symmetric 
parity rotation, and which also shows the relationship 
between the "array management software", the virtual disk 
and the member disks of the array; 

FIG. 5 is a table depicting how parity may change with the 
writing of new data into an array member; 

FIG. 6 is a flow chart illustrating the steps employed for 
a Read, Modify and Write (RMW) Algorithm using the 
novel techniques of the present invention to inhibit the 
formation of write holes; 

FIG. 7 is a table illustrating how either technique, RMW 
or reconstruct write (RW) algorithms may be employed in 
accordance with the invention, and; 

FIG. 8 is a flow diagram of the RW algorithm modified 
with the teachings of the present invention. 

DESCRIPTION OF THE ILLUSTRATIVE 
EMBODIMENT 

Referring now to the drawings and especially FIG. 2, a 
typical five member RAID Array 10 is depicted in the table, 
each member disk drive being labeled #l-#5 respectively 
across the top of the table. As was discussed above with 
reference to FIG. 1, a slice extends, at the bit level, across 
members or disk drives of the Array. In a Raid 5 device, the 
data is placed on the disks in blocks, for example each of 512 
bytes and given a Logical Block Number (LBN). For 
example, as shown in the table of FIG. 2, block 0, drive #1, 
block 4, drive #2, block 8 drive #3, block 12, drive #4 and 
parity in an unnumbered block in drive #5. (Note that each 
block that contained the error code information [parity] 
would be numbered with a SCSI [Small Computer System 
Interface] number, all blocks in a single drive being num- 
bered consecutively, but for purposes of discussion and 
simplification have not been given an assignment number 
herein. Thus the numbered blocks, i.e those numbered in the 
table and labeled LBN, represent data.) A single "slice" 
would be a single bit of each of block 0, 4, 8, 12 and a parity 
bit. Accordingly there would be 512 bytes times 8 bits/byte 
slices of bit size data. A "chunk", is defined as the number 
of blocks placed on one drive before the next LBN block of 
data is written on the next drive. In the illustrated instance 
in FIG. 2, a chunk includes 4 blocks and a "strip", while 
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capable of being defined a number of ways, is shown as 
being 4 blocks in depth times 5 drives or really 16 blocks of 
data plus 4 blocks of parity information. (In the example or 
sample shown in FIG. 4, a chunk is equal to two blocks; a 

5 strip is equal to 8 data blocks and two parity blocks.) Thus 
strip 0 in FIG. 2 includes user data blocks, or LBN's, of 
0-15+four parity blocks, while strip 1 includes LBN's of 
16-31 plus 4 parity blocks, and Strip 2 includes LBN's of 
32^47 with four blocks of parity, and so on. As is shown in 

10 the abbreviated table of FIG. 2, the Raid is called a "left 
symmetric rotate" because the four high (sometimes referred 
to as "deep") parity blocks move around from drive #5 in 
strip 0 to drive #4 in strip one to drive #3 in strip 3 and so 
on. Notice also that the LBN continues its numbering 

35 sequence in the same spiral fashion, LBN 16-19 being 
written in strip 1 to drive #5, and continuing through drive 
#1 (LBN 20-23), through LBN 31 appearing in drive #3. 

Every disk drive carries certain extra information on it to 
permit identification of the disk drive, as well as other 

20 pertinent information. Each drive, at the lower extent of the 
table in FIG. 2, includes device specific information, includ- 
ing an ID block, an FE block, and an FE-Dir block. Note that 
the total lengths of the blocks under each drive are repre- 
sentative of the size or capacity of each drive. For example, 

25 disk drive #1 is the smallest drive, disk drives #2-#4 are the 
same size, and disk drive #5 is the largest capacity drive. 
Because drives are interchangeable, and a large drive can be 
used in an array with smaller capacity drives, a substitute or 
spare disk drive can be installed in the RAIDset as long as 

30 the substitute drive has at least as much capacity as the 
smallest drive in the RAIDset at the time of its creation. The 
rest of the spaces are not used by the user, and are considered 
and labeled "Fill Space**". 
Turning now to the device specific information carried by 

35 each of the drives, and as illustrated in FIG. 2, the lowest 
boxes labeled "ID" and associated with each of Drives 
#l-#5 representatively contain such information as RAID 
membership information, (order, serial number of all 
members, EDC on ID information to protect metadata 

40 format) This informaton is contained in data blocks on the 
disk and are readable by the disk controller. This is included 
as part of the device or member metadata illustrated in FIG. 
2 . Part of the device or member metadata is a forced error 
box labeled FE, one such box being associated with each 

45 disk drive #l-#5. The FE box represents one or more blocks 
of data. Within the FE blocks are a single bit per device data 
block, i.e. LBN and parity. Anotherwords, each data block 
on each of the drives has an FE bit. The single bit represents 
whether the associated block can be reliably used in Xor 

50 calculations. If the bit is 'set', (e.g. a "1"), as will be 
explained hereinafter, the data block is considered bad as to 
its reliability, and therefore the block, or any of its bits, 
cannot be used for Xor calculations. There are enough 
blocks of FE bits to represent all blocks in the device, 

55 including the "Fill Space" of larger devices, but not the 
metadata itself. The third and last box, which includes, 
especially with larger drives, several blocks of data written 
on each disk or device and labelled "Forced Error Directory 
Blocks", or FE-Dir. Each FE-Dir block contains 1 bit for 

eo every block (512 bytes in our example) of forced error bits. 
This block is used for a quick lookup of suspect data or 
parity blocks. If a bit is not set, then there are no forced 
errors in the Logical Block Address (LBA) range the FE 
block covers. For faster lookup, the FE-DIR information is 

65 cached in the controller cache. (Described hereinafter). 

One other group of blocks which exists and are written to 
the disks, are those containing RAIDed metadata. These 
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blocks contain much the same information as in the indi- Although not shown in the drawing, the commercial unit 
vidual device metadata blocks of information, with the mentioned above includes in the backplane a direct com- 
exception that since they are RAIDed, parity is employed so munication path between the two controllers by means of a 
that information can also be recovered here, if a block of serial communication universal asynchronous receiver/ 
data or parity as such is considered bad, or a drive is 5 transmitter (UART) on each controller. The controllers use 
removed etc. The ID box in the RAIDed Metadata repre- mis communication link to inform one another about con- 
sentatively contains RAIDset information (serial number, troller initialization status. In a dual-redundant 
size), and EDC on ID information to protect the metadata configuration, such as the configuration 20 shown in FIG. 3, 
format. The RAIDed forced error "FE" bits still employ 1 bit a controller that is initializing or reinitializing sends infor- 
per RAIDed user data block, representing whether a block is 10 mat ion about the process to the other controller. Controllers 
suspect and its data cannot be relied upon. (This does not send keep alive messages to each other at timed intervals, 
mean it is unusable, as with the device specific metadata FE ^ cessation of communication by one controller causes a 
block, only unreliable or suspect.) The Forced Error Direc- "failover" to occur once the surviving controller has dis- 
tory box, which representatively may contain multiple disk a5]ed the otner controller. In a dual-redundant configuration, 
blocks, contains 1 bit per disk block of forced error bits. Like 15 if one controller fails, all attached storage devices continue 
its device specific partner FE-DIR, it is used for quick t o be served. This is called "failover". Failover occurs as has 
lookup of suspect blocks. Moreover, if there is any fill space been prev iously mentioned, because the controllers in a 
in the RAIDed Metadata strip, this fill will take up the extra dual-redundant configuration share SCSI-2 device ports and 
space to the stripe boundary. These blocks have been labeled therefore access to all attached storage devices. If failover is 
Fill Space*, (note the single asterisk to denote RAIDED area 2D to be acme ved, the surviving controller should not require 
fill Space), access to the failed controller. The two way failover corn- 
Forced error promotion occurs when two blocks within a munication line 37 is depicted in FIG. 3. 
strip have unrecoverable read errors or corresponding device storageWorks™ controllers in a dual redundant configu- 
FE bits set. For example suppose block 6 and the parity ration have ^ same information at all times, 
block in Strip 0 are unreadable. T^is means that there is no 25 WheQ j nformation is entered into one control- 
longer redundancy for blocks 2, 10 and 14. In order to regain lef that comroller ^ the new i n f orma tion to the other 
redundancy the following procedure may be employed: controller. Each controller stores this information in a con- 
Write the RAIDed FE bit corresponding to the user data residem nonvolatile me If one contr oller fails, 
block lost in strip 0; (if there were more data blocks lost, the survivi controUer continues to serve the failed con- 
write 0 to all lost data blocks); Calculate panty, incluoang the 30 deyices to host CQ thus ob viating shared 
zeroed block just written, and good data blocks 2, 10 and 14; mem access ^ resolves any discrepancies 
Write the parity block; and clear panty block s device FE, b usi the newest mforma tion. 
and the lost data blocks device FE. The RAIDed FE bit set J 0 * „ 

denotes that block 6 has been lost. A subsequent write of S ^ lc ?™ W ™ WI ? poncnt ? ^ a ^f 0 "™ can 

block6 willwritecorrectdaUwithfullredundancy andclear 35 co ? mumca ' e wth ^troUer to synchronize spe- 

the RAIDed FE bit. It is the time from when block 6 has cial ev f nts *f ^ een the ha ^ dware on ^J*"* Some 

been lost to the point that it has again written with new good ex ™P les of s P e ^ al events are SCSI * 2 bus resets > 

data when forced error promotion in RAID is quite useful. cache slate chan ^ and ^agnostic tests. 

FE promotion allows the remaining other 'good' blocks of Eacn controller can sense the presence or absence of its 

data to be provided with full RAID-5 data integrity even in 40 cache to ^ U P cache diagnostics and cache operations and 

the face of multiple block errors existing within the strip of can sense ^ presence or absence of the other controller's 

a RAID set. cache for dual-controller setup purposes. 

For other information on fast initialization and additional The failover of a controller's cache occurs only if write- 

material on metadata, and for a more complete understand- back caching was in use before the controller failure was 

ing of the RAIDed Metadata, patent application Ser. No. 45 detected. In this case, the surviving controller causes the 

08/542,827, filed on even date herewith, entitled "A RAID failcd controller's cache to write its information into the 

ARRAY DATA STORAGE SYSTEM WITH STORAGE surviving controllers cache. After this is accomplished, the 

DEVICE METADATA AND RAID SET METADATA" and cache is released and access to the devices involved is 

assigned to the assignee of this invention, is hereby incor- permitted. The cache then awaits the failed controller's 

porated by reference. 50 return 10 tne dual-redundant configuration through reinitial- 

Turning now to FIG. 3, a system block diagram is shown ^on or replacement, 

including a dual-redundant controller configuration 20. The If portions of the controller buffer and cache memories 

configuration may contain Storage Works™ components, fail, the controller continues normal operation. Hardware 

HS-series controllers manufactured by Digital Equipment error correction in controller memory, coupled with 

Corporation. As shown, each controller 21, 22 is connected 55 advanced diagnostic firmware, allows the controller to sur- 

to its' own cache 23 and 24, each with bidirectional lines 25, vive dynamic and static memory failures. In fact, the con- 

26 and 27, 28. Each controller 21, 22 is connected in turn troller will continue to operate even if a cache module fails, 

through I/O lines 29 and 30 respectively to a host interface For more information concerning the design and architecture 

35, which may include a backplane for power-supply and of HS-series Storage Work™ array controllers, see Vol. 6, 

bus connections and other things as discussed below. One or 60 No. 4, Fall 1994 (published 1995) issue of the "Digital 

more host computers or CPU's 31 and 32 may be connected Technical Journal", page 5 et. seq. 

to the host interface 35. The backplane includes intercon- Referring now to FIG. 4, shown is a diagrammatic rep- 

troller communication, control lines between the controllers resentation of sample data mapping for a RAID level 5 array 

and shared SCSI-2 device ports such as shown schematically 50, illustrating two block high data striping (as opposed to 

at 33 and 34. . Since the two controllers share SCSI-2 device 65 four in FIG. 2) with interleaved parity and left symmetric 

ports the design enables continued device availability if rotation, and which also shows the relationship between the 

either controller fails. "array management software'* 60, the virtual disk 65, and the 
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member disks 1-5 of the array 50. RAID level 5 in its pure entire slice from 0 to 1. Of course if the new bit state and the 

form was rejected because of its poor write performance for old bit state are the same, no change will occur in the parity, 

small write operations. Ultimately chosen was RAID level 5 Thus there is no necessity of examining (reading) any of the 

data mapping (i.e., data striping with interleaved parity, as other bits of the other LBN's in slice 6 to determine and 

illustrated in FIGS. 2&4) coupled with dynamic update 5 write the new parity. Knowing only the state of the old data, 

algorithms and writeback caching via the redundant caches the new data and the old parity, one can determine the new 

23, 24 in the redundant controllers 20 (FIG. 3) to overcome parity. The new data and the new parity may be then written 

the small-write penalty. to the disk. 

The Array Management Software 60 manages several The problem of a write hole occurs when writing new or 

disks and presents them to a host operating environment as 10 modified data to an old block and/or writing a new parity and 

virtual disks with different cost, availability, and perfor- a crash, orpower outage occurs and one wnte is successful 

mance characteristics than the underlying members. The and the othc ' wn * d ? csn 1 takc P ac £ ^ » when * c 

Software 60 may execute either in the disk subsystem or in » said l ° be "^consistent . The problem in the 

, . . J j. . . . c } industry is that when a crash occurs in the middle of a wnte, 

a host computer. Its principal functions are: if ^ ^ nQ recQrding Qf what was done> (he 

(1) to map the storage space available to applications onto is stan dard or conventional manner of proceeding is to exam- 
the array member disks in a way that achieves some ine or T&pAr iiy the entire disk and really the entire RAID set. 
desired balance of cost, availability, and performance, xhe purpose of the parity is to be able to reconstruct the data 
and, on a disk when the disk crashes. So if parity is inconsistent, 

(2) to present the storage to the operating environment as data cannot be properly reconstructed. If it is not known 
one or more virtual disks by transparently converting 20 which data can be relied upon or if there is trash in one or 
I/O requests directed to a virtual disk to 1/0 operations more LBN's associated with the inconsistent parity, then the 
on the underlying member disks, and by performing material may be impossible, without more, to recover. It 
whatever operations are required to provide any should also be noted that the occurance of a write hole when 
extraordinary data availability capabilities offered by a RAIDset is reduced (drive missing and not replaced) can 
the array 50. 25 lead to the loss of data from the missing member. This is 

Parity RAID array 50 appears to hosts, such as the host because the missing data is represented as the Xor of the data 

computers 31 and 32, as an economical, fault-tolerant virtual from the remaining members and in no other way. If one 

disk unit such as the virtual disk 65, where the blocks of data write succeeds and one does not, this Xor will produce 

appear consecutively (by their LBN) on the drive as shown invalid data. Thus the present invention not only allows us 

in FIG. 4. A Parity RAID virtual disk unit with a storage 30 to make parity consistent quickly reducing the probability of 

capacity equivalent to that of n disks requires n+1 physical losing data but prevents loss of data due to write holes while 

disks to implement. Data and parity are distributed (striped) a RAIDset is reduced. 

across all disk members in the array, primarily to equalize To ehminate this write hole, designers had to develop a 

the overhead associated with processing concurrent small method of preserving information about ongoing RAID 

write requests. 35 write operations across power failures such that it could be 

As has previously been explained with respect to FIG. 1, conveyed between partner controllers in a dual-redundant 
if a disk in a Parity RAID array fails, its data can be configuration. Non-volatile caching of RAID write opera- 
recovered by reading the corresponding blocks on the sur- tions in progress was the manner determined to be of most 
viving disk members and performing a reconstruct, data use to alleviate the problem not only in dual-redundant 
reconstruct or reconstruct read (using exclusive-OR [Xor] 40 configuarions, but in single controller operations, 
operations on data from other members). So one of the problems is to where, if anywhere, the parity 

Recalling that a slice is said to be consistent when parity is inconsistent, and the other problem is if the system crash 

is the "exclusive or" (Xor) of all other bits in that slice, there causes the disk drive to fail, or if the array is reduced, (i.e. 

are three principal algorithms to be considered for modify- one drive missing) it is essential to know that your parity is 

ing data and parity and reconstructing data and parity when 45 consistent so that the reduced, failed etc. drive data may be 

it is desired to modify the same or there is a failure causing reconstructed, merely by Xoring the appropriate or remain* 

the RAID set or a portion thereof to fail. When information ing bit for each remaining member in each slice. In a striped 

is to be written to a disk drive or particular LBN, (1) a Read, RAID 5, such as our example in FIG. 2, (or in FIG. 4) the 

Modify and Write algorithm (RMW) is employed when loss of a drive would of necessity include data and parity, 

parity is consistent and data is being written for some or a 50 For example, if drive 5 in the table of FIG. 2 fails, the parity 

small subset of the members, (2) a Reconstruct Write (RW) for strip 0 and data for strips 1 and 2 will be lost. However, 

algorithm is employed when most or all of the members are the lost data for each slice may be recreated or reconstructed 

being written, and (3) Non-Redundant Write (NRW) is by simply Xoring the remaining LBN's. Another way to 

utilized when when the member or drive containing the look at a write hole is that if all of the writes succeed before 

parity data is missing. 55 a failure, and as long as this is known, then parity will be 

Start with a consistent slice (i.e. parity is correct), and it known to be consistent. The same is true if none of the writes 

is desired to write new data into a block. That means a new commanded are made before the RAID set is interrupted by 

parity for each effected slice must be constructed. Assume a power outage, loss of a drive etc, then the parity will be 

that the RMW algorithm is going to be employed. The consistent. In each of those instances, a write hole problem 

algorithm is the Xor of the old data with the new data, and 60 exists when it is not known whether parity is consistent (that 

the result Xor'd with the old parity to create the new parity. either all writes succeeded or no writes were issued). A write 

By way of example, and referring now to FIG. 5, suppose hole problem thus occurs when updating more than one 

LBN 25 in FIG. 2 was to be modified. Suppose the bit member (drive) and not all of the writes succeed or insuf- 

written in slice 0 of the old data block (LBN 25) is 0, but the ficient knowledge is obtainable so that it is not evident that 

first bit in the modified slice in data block LBN 25 is 1. An 65 a write has or has not succeeded. 

Xor of the old bit (0) and the new bit (1) gives 1 . Xoring the Double failures may also occur. Power failures that cause, 

result (1) with the old parity of 0 changes the parity for the upon resumption of power, a loss of a drive, are considered 
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double failures, its data where ever the system was writing, 
is missing and could be lost if you do not know whether the 
writes were complete, partially complete etc. 

In accordance with the invention, by first recording all 
writes and selected "old information" to non-volatile 5 
memory and recording those writes and old data status, so 
that what is known is not only the data, but where it is going, 
and the state of the data. Given this information, parity can 
be made consistent or data recovered even when a drive was 
missing prior to or lost during a power failure that occurred 10 
while RAID writes were in progress. 

To this end, and referring now to FIG. 6 which illustrates 
an improved flowchart for an RMW (Read, Modify and 
Write) the steps include additional information to that dis- 
closed above with reference to FIG. 5. The steps are as is 
follows, when a write command is issued: 

1. Read old data (OD) into NV cache (23 or 24). 

2. Read old (existing) parity (OP) into the NV Cache (23 
or 24) 

3. "Mark" new data (ND) as in the "Write Hole" (WH) 2D 
(record) state 

(As will be discussed hereinafter, write to a cell which 
corresponds to a block-cache metadata entry). 

4. "Mark" the old data (OD) as in the "old data" state, 

5. Write the new data (ND) 

6. Mark the old parity in the "Old Parity" (OP) state 

7. Write new parity (NP). 

8. When both writes succeed 

"Mark" at least one of the three blocks as "Parity valid" (PV) 3Q 
i.e. either the "old data" (OD), the "new data" (ND) or the 
old parity (OP). — This is done in case an interrupt (failure) 
occurs, which without more information the controller 
would think a write hole had occurred and some of the data 
had been lost, when in fact it has not been lost. The reason 35 
for the original marking is so that the new parity can always 
be found from Xoring the old data with the new data and 
then Xoring the result with the old parity state, as was shown 
in the example of FIG. 5. 

9. Null out "marks" for "other" blocks 40 

10. Null out "mark" for blocks containing the parity valid 
(PV) 

The cache 23 and 24 includes a large amount of memory 
which may conveniently be divided into blocks, e.g. 512 
bytes per block. Starting at the beginning of the cache, no 45 
matter what address it is, consider every 512 bytes a line 
chunk of memory, as being indexed in an array (the blocks 
arranged indexed in an array), which may be numbered 
consecutively for the number of blocks in the cache. The 
index is used to allow indexing into a table to record 50 
information about the contents of the block to which the 
index corresponds. Conversely, the table can be scanned to 
locate blocks in certain states which require error recovery 
in the event of a system failure (i.e., crash). 

In this connection, by setting aside 8 bytes of data for each 55 
block on the disk for recording "cache metadata entry", 
which is indexed by block address converted to a unique 
integer, the state of any block in the cache may be deter- 
mined. With a contiguous array of blocks, each having 512 
bytes, this number is derived by simply taking the block eo 
address, subtract the base of the array, shift it right by 9 bits 
(same as divide by 512), each block will then have a unique 
integer associated therewith. 

In the cache metadata entry, the following is preserved: 
(1) the logical block address (LBA) on the disk, which is 65 
the address necessary for the drive controller to fetch or 
write the data. 



(2) The device index, which indexes another table, which 
uniquely identifies the device. Now where the block is 
and where the data resides or should reside on the 
media is known. 

(3) State of the data: 
(Minused (free to write) 

l=writeback — means that data needs to be written to 
the disk at some time in the future. 

2«write hole — indicates that this block is dirty data, 
needs to be written to a member of a RAID set (either 
3 or 5), but we are in the process of writing it to a 
RAID set which has parity (Raid sets 0 and 1 don't 
have parity) This means that the parity of the par- 
ticular slice being written may be inconsistent on that 
particular slice, if the write hole is set. 

3- this is old data 

4=this is old parity — This is really equivalent to 3 
because given the LBA and the device index (see (1) 
& (2) above, it is known which member and where 
it is, given the chunk size, whether this is parity data 
or not can be computed. 
5«parity valid — Mark means we write to a cell which 
corresponds to the block, 1 cell maps to 1 unique 
block (cache metadata entry). 
When a crash occurs, and the system reboots, this array is 
examined by the controller to determine if there is anything 
in the write hole state. The first operation that the controller 
does is to locate all the parity valid (PV). Then it nulls out 
any metadata of other blocks that fall in the same slice on the 
same raid set (that is what is meant in step 9 above of nulling 
out the "marks" for Other blocks), and subsequently nulls 
out the metadata marked parity valid. 

The buffer indexes for the three blocks is also known. One 
of them is picked and a "parity valid" is stored into it (in fact 
the device index for the parity member, the LBA of the 
parity block, and a state which indicates "parity valid" is 
stored.) Then the other two may be nulled out, and finally the 
parity valid indication may be nulled out. Thus if a crash 
occurs after this is done, then that slice will be not effected 
by the crash unless one of the devices are wiped out and the 
data has to be reconstructed. In this manner the metadata 
information has been cleared of anything which would 
indicate, upon reboot of the system, that there is a problem 
to correct. 

Suppose that a crash occurs before parity valid is set but 
write hole has been set, i.e everything is in cache memory 
(23 or 24), and both writes are about to be issued. In this 
case, the old data, the new data and old parity are all in cache 
memory. All that is necessary at that stage is to recompute 
the new parity by Xoring the old and new data and then 
Xoring the result with the old parity. This would create the 
new parity. The key then is to remember what has been done 
and in what state is the data. 

The Write hole step is part of metadata. The write back 
step must be changed to write hole which tells us is that 
parity may be inconsistent and that new data destined for the 
specified disk block is present and where in the cache that 
block resides. The other pieces of data from the slice that are 
present give the entire picture. What data is present and how 
it is marked is a function of the algorithm being used. 
Assume that both reads (of old data and old parity) are 
issued essentially at the same time. If a crash occurs now, 
except that the new data is marked as being dirty and will 
eventually have to be written starting at the top of the 
algorithm in FIG. 6, there is nothing more to do. 

Suppose that the old data read finishes first. Mark the old 
data as old data. If a crash occurs now, error recovery will 
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simply clear the old data in the cache leaving the old data 1. Read into cache all non-parity blocks (i.e. data blocks) 

still on the media and this will be known because the new of members for which there is no new data to write, 

data hasn't yet been marked as write hole (WH). Suppose When complete;. 

the new data has been marked WH and before marking the 2. Mark all new data blocks with "WH", and all old data 

old data a crash occurs. In this instance, a write will be 5 blocks with "Old data". — when marked with WH, 

forced of the new data but because the old data cannot be (write hole), this is saying (if crash occurs) that in 

found It will be assumed that the old data is still on the media process of writing; the data on the disk is unknown and 

(which it is). In this case, missing data may be reconstructed. must be treated as suspect. 

Suppose both the old and new data are marked and a write 3 Compute new parity as Xor of all new data blocks and 

is issued. If a crash occurs here the old parity is correct on 1Q anv cj a t a blocks read. 

the media but missing from the cache but may be fetched 4 j ssue wr jt es 0 f new data 

during recovery. Thus, in general, recovery action is dictated 5 ; w Wfite of new ity 

by which blocks from the slice are present in the cache and ^ ^ ^ ^ 

their s totes. The key to rewriting or insuring the proper data oew upy „ £ ^ } ff f ^ occufs 

is written to the proper place after a fault, is that of knowing ^ ^ ^ ^ ^ tem comQS up 

what pieces of data are available, and mark those before an " ^ P ^ ^ ^ pyj& jf y ^ ^ ^ ^ Pj 

0P ? ra D^?w T f 15 deS ™r , f w v m ^ null out anything in the same slice because it knows 

In RMW, and referring now to FIG 1-}^^*™*^ that what was written was complete, 

that it is desired to write new data (Nl) into LBN 0 m Disk „ ,~ . 

Drive #1 replacing 01, which was originally in LBN 0. ?• Mark 'other blocks null. (Setting to 0 state, unused, 

Further assume that Disk Drive #3 has failed or is missing 20 f fee etc ) 

(either before or after a power failure) and that data 03 is 8. Mark the selected block null (step 6.) 

unavailable, i.e. it is not resident in cache. Further suppose For recovery purposes, the important consideration is that 

that the crash occurred just after issuing writes of the new before any writes are done, the old data is marked as well as 

parity and new data Nl. Nl will be marked as WH (write the new data. This protects against loss of a drive since all 

hole). You will also have in cache, 01 (old data) for LBN 0 ^ 0 f the data except parity is available after a crash. Thus, 

as well as OP (old parity) for the previous parity as read from double failures are handled for this algorithm as well, 

disks #1 and #5 respectively Subsequent to a crash, error StorageWorks™ controller firmware developers, have 

recovery will recompute the new parity as the Xor of 01, automated Parity RAID management features rather than 

Nl, and OP and write it out to the appropriate disks along require manual intervention to inhibit write hole problems 

with new data Nl. After these writes occur, the parity will be 30 a f ter failures. Controller-based automatic array management 

consistent again and error recovery will go through PV steps is superior to manual techniques because the controller has 

to clean up the metadata. It should be recognized that 03 the best visibility into array problems and can best manage 

will always be able to be computed unless another disk in the anv situation given proper guidelines for operation. 

RAID set is lost. Before the write hole recovery, 03 is the Thus the present invention provides a novel method and 

Xor of 01, 02, 04 and OP. After write hole recovery, 03 is 35 apparatus for reconstructing data in a computer system 

the Xor of Ni, 02, 04 and NR employing a parity RAID data protection scheme. By 

One more condition: suppose a spare drive had been put employing a write back cache composed of non-volatile 

into missing Drive #3 but before the write error recovery memory for storing (1) writes outstanding to a device, (2) 

takes place. 03 is computed as stated above, i.e. the Xor of selected "old data", and (3) storing metadata information in 

01, 02, 04 and OP, and then written to the replacement 40 the non-volatile memory, it may be determined if, where and 

spare or substitute drive, and clear the FE bit that corre- wnen the write was intended or did happen when the crash 

sponds. occurred. An examination is made to determine whether 

By way of further example, assume that Disk Drive #3 par i t y is consistent across the slice, and if not, the data in the 

was present at the time of the power failure, but replaced non-volatile write back cache is used to reconstruct the write 

with a spare just after reboot (or failover) and all the FE bits 45 that was occurring when the crash occurred to insure con- 

we re set to one. This is how the write hole is recovered: sistent parity, so that only those blocks affected by the crash 

The data in the cache is labelled as Ni, 01, and OR The have to be reconstructed. 

data on the disk is 02, 03 and 04. The steps may be Although the invention has been described with a certain 

considered as follows: degree of particularity, it should be recognized that elements 

1. Determine the Metadata FE's for all members (03 is 50 thereof may be altered by person(s) skilled in the art with out 
the only one set). departing from the spirit and scope of the invention as 

2. Read 02 and 04. hereinafter set forth in the following claims. 

3. Xor 01, 02, 04, Op to give 03. What is claimed is: 

4. Write 03 to Disk Drive #3, 1. A method of reconstructing data in a computer system 

5. Clear FE bit for 03 55 employing a Parity RAID protection scheme for a striped 

6. Xor Ol, Nl, OP to give NP array of storage devices that employ parity recovery in the 

7. Redundantly mark 01 as 01, Nl as WH, and OP as event of a crash, said computer system including a write 
OP. back cache composed of non-volatile memory for storing (1) 

8. Write Nl to Disk Drive #1 and NP to Disk Drive #5. write data outstanding that is to be written to storage 

9. Do PV steps of marking @ least one of 01, OP or Nl 60 devices, and (2) metadata information; said metadata infor- 
as parity valid (PV). mation comprising a first field containing an LBA of said 

10. Null out marks for "other blocks" not marked in 9 write data outstanding, a second field containing device IDs 
above, and; that correspond to said write data outstanding, and a third 

11. Null out mark for block selected as PV in step 9. field containing status that indicates consistent or inconsis- 
In a reconstruct write (RW) the situation is that of 65 tent write slice parity, comprising the steps of: 

determining which data blocks are present (to be written) for storing old data in said non-volatile memory from storage 

each data block high slice and taking the following steps: devices that are intended for said write data 
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outstanding, to protect said old data in the event a crash sponding to said outstanding write data, and a third field 

occurs during a write to a storage device; containing a parity consistent/inconsistent status of data in 

storing old parity that corresponds to said old data in said said non-volatile memory, comprising the steps of: 

non-volatile memory; determining from said metadata information a storage 

determining from said metadata information where a 5 device for which said outstanding write data is 

given write data outstanding was intended when a crash intended; 

occurs; storing in said non-volatile memory associated old data 

determining whether parity is consistent across a write from at least areas on storage devices intended for 

slice corresponding to said given write data outstanding write data, to protect said associated old 

outstanding, and if parity is not consistent, using said 10 data in the event a crash occurs during a write of said 

old data stored in said non-volatile memory and said outstaxidinrg write data to a storage device; 

the old parity stored in said non-volatile memory to determining whether parity is consistent across a data 

reconstruct said given write data outstanding to thereby slice that includes said LBA, and if not using data 

insure consistent parity, whereby; stored in said non-volatile memory to reconstruct a 

only slices of said given write data outstanding whose 15 write of said outstanding write data that was in progress 

parity is not consistent and are affected by the crash when a crash occurred to insure consistent parity, 

have to be reconstructed. whereby; 

2. A method in accordance with claim 1 including the step only data slices that have inconsistent parity and that are 
of: 2D affected by a crash have to be reconstructed. 

storing said write data outstanding in said non-volatile 11. A method in accordance with claim 10 including the 

memory; and ^P 5 of: 

marking said stored write data outstanding with indicia marking said write data that is outstanding to a storage 

indicating that said stored write data outstanding is in device with indicia indicating that said wnte data that 

a write hole state 25 ^ outstanding to a storage device is in a write hole state . 

3. A method in accordance with claim 2, including the step 12. A method in accordance with claim 11 including the 
of . step of: 

marking said old data stored in said non-volatile memory marking said associated old data storing in said non- 

as being in an old data state. volatile memory as being in an old data state. 

4. A method in accordance with claim 3, including the step 30 13. A method in accordance with claim 12, including the 
of: ste P of: 

writing said write data outstanding to a designated storage writing said write data that is outstanding to a storage 

device. device to a designated storage device. 

5. A method in accordance with claim 4, wherein said step 14. A method in accordance with claim 13, including the 
of writing said write data outstanding to a designated storage 35 ^P 5 °f : 

device occurs subsequent to said step of storing said write calculating a new parity for said write data that is out- 
data outstanding in said non-volatile memory. standing to a storage device; and 

6. A method in accordance with claim 5, including the writing said new parity to a storage device. 

steps of: 15. A method in accordance with claim 14 including the 

calculating a new parity for said write data outstanding; 40 step of: 

and subsequent to said step of writing said write data that is 

writing said new parity to a storage device. outstanding to a storage device to a designated storage 

7. A method in accordance with claim 6 including the step device, and said step of writing said new parity to a 
of ; storage device, marking at least one of said write data 

subsequent to writing said write data outstanding and said 45 that is outstanding to a storage device or said new data 

new parity to a storage device, marking at least one of ^ being parity valid. 

said old data, said write data outstanding or said old 16. A method in accordance with claim 15, including the 

parity as parity valid. ste P 

8. A method in accordance with claim 7, including the step 50 subsequent to said parity valid marking step as set forth in 
of: claim 15, marking null blocks that are not marked 

subsequent to said step of marking as parity valid, nulling parity valid. 

out marks for other blocks that were not marked as 17. A method in accordance with claim 16, including the 

parity valid by the step of claim 7. ste P 

9. A method in accordance with claim 8, including the step 55 subsequent to said marking null step as set forth in claim 
of: 16, nulling said parity valid marks. 

subsequent to the step set forth in claim 8, nulling out said 18- In a computer system employing a Parity RAID 

parity valid mark of claim 7. protection scheme, and an array set of disk drives connected 

10. For use in connection with a computer system, a to said computer system, said disk drives containing data 
method of reconstructing data in a Parity RAID protection 60 written thereon in a predetermined and proscribed format of 
scheme for attachment to said computer system, a write back data blocks: 

cache composed of non-volatile memory for storing (1) a controller having a write back cache composed of 

write data that is outstanding to a storage device and non-volatile memory, said non-volatile memory being 

associated old data read from a storage device, and (2) disposed intermediate said array set of disk drives and 

metadata information; said metadata information compris- 65 said computer system; 

ing a first field containing a LBA of said outstanding write said non-volatile memory storing (1) data block writes 

data, a second field containing a storage device ID corre- outstanding to a disk drive, (2) associated data read 
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from a disk drive, and (3) metadata information con- 
cerning information contained on said disk drives; 

said metadata information comprising a first field con- 
taining an LBA of the associated data read, a second 
field containing a disk drive ID, and a third field 5 
containing a block status of data blocks contained on 
said disk drives; 

a program contained within said controller for determin- 
ing from said metadata information where a write is 
intended prior to a write being made to a disk drive of 10 
said array set; and 

means in said controller for determining whether parity is 
consistent across a slice of multiple data blocks written 
on a disk drive, and if not using data stored in said s 
non-volatile memory to reconstruct a write to insure 
consistent parity, whereby reconstruction of data slices 
is necessary only for those slices on said set of disks 
drives for which inconsistent parity is determined, 

19. In combination: 2Q 

a computer having a central processor; 

an array of disk drives comprising computer- readable disk 
memory having a surface formed with a plurality of 
binary patterns constituting an application program that 
is executable by said computer; 25 

a controller connected intermediate said array of disk 
drives and said central processor; 

said disk drive array in conjunction with said controller 
employing a Parity RAID protection scheme; 
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a non-volatile cache contained within said controller; 
said controller effecting reading and writing of data 

blocks from and to said array of disk drives; 
said application program including instructions for a 

method of reconstructing data slices in said disk drive 

array; 

said application program including instructions for read- 
ing into said non-volatile cache (1) writes outstanding 
to a disk drive, (2) associated data read, and (3) 
metadata information concerning said disk drive array; 

said metadata information comprising a first field con- 
taining LBAs of data blocks, a second field containing 
disk drive IDs, and a third field containing data block 
status; 

said application program including instructions for deter- 
mining from said metadata information to which disk 
drive a write is intended; 

said application program including instruction for deter- 
mining whether parity is consistent across a data block 
slice, and if not using data in said non-volatile cache to 
construct or to reconstruct write data, when necessary, 
to insure consistent parity, whereby only data block 
slices effected by inconsistent parity will have to be 
constructed or reconstructed. 

***** 
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